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ABSTRACT Based on the observation that 
a single mutational event can delete or insert 
multiple residues, affine gap costs for sequence 
alignment charge a penalty for the existence of 
a gap, and a further length-dependent penalty. 
From structural or multiple alignments of dis- 
tantly related proteins, it has been observed 
that conserved residues frequently fall into 
ungapped blocks separated by relatively non- 
conserved regions. To take advantage of this 
structure, a simple generalization of affine gap 
costs is proposed that allows nonconserved 
regions to be effectively ignored. The distribu- 
tion of scores from local alignments using these 
generalized gap costs is shown empirically to 
follow an extreme value distribution. Examples 
are presented for which generalized affine gap 
costs yi el d su per i or al i gn men ts from the stand- 
points both of statistical significance and of 
alignment accuracy. Guidelines for selecting 
generalized affine gap costs are discussed, as is 
their possible application to multiple align- 
ment. Proteins 32:88-96, 1998. 
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INTRODUCTION 

The comparison of two protein or DN A sequences 
generally is guided by a similarity function that 
assigns a score to all possible alignments. The score 
for a given alignment is most often taken to be the 
sum of "substitution scores" for aligning pairs of 
residues, and "gap scores" for aligning strings of 
residues in one sequence with null characters intro- 
duced into the other. The gap scores in earliest 
common use charged a fixed penalty for each residue 
in either sequence aligned with a null in the other. 
Because under this system the cost of a gap is 
proportional to its length, we call these length- 
proportional gap costs. Using these costs, algorithms 
for constructing optimal global or local alignments 
require 0(mn) time, where m and n are the lengths 
of the sequences being compared. 1-5 

Over the years it was observed that the optimal, or 
highest-scoring, alignments produced by length- 
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proportional gap costs often invoked a large number 
of short insertions or deletions and were not biologi- 
cally plausible. That a single mutational event might 
insert or delete a largenumber of residues suggested 
that a long gap should not cost substantially more 
than a short one. The simplest way to capture this 
idea istochargeagapof length fc the cost a + bk.the 
existence of a gap costs a, and each residue aligned 
with a null costs b. In certain cases where the 
biologically correct alignment is known, the use of 
such "affine"in place of length-proportional gap costs 
has been shown tobenecessary if thetruealignment 
is to be the highest-scoring one. 6 Fortunately, algo- 
rithms for the construction of optimal alignments 
using affine gap costs are only slightly more compli- 
cated than those required for length-proportional 
gap costs, and require only a constant factor more 
space and time. 79 It is possible, of course, to define 
more complicated gap costs, for example as an 
arbitrary function of gap length. 10 For the class of 
"concave" gap costs, optimal alignment algorithms 
may still be constructed that require only 0{mn) 
time. 11 However, these algorithms are substantially 
more difficult to implement and almost all alignment 
programs in popular use have confined themselves to 
affine gap costs. 

Many methods for multiple alignment, structural 
alignment, and sequence-structure threading have 
been developed. When comparing two protein struc- 
tures, it is often apparent that secondary-structural 
elements may be superimposed closely in space, 
whilethe loops that connect them remain difficult if 
not impossibleto align. Accordingly, many measures 
of the quality of a structural or threading alignment 
essentially ignore these intervening loops. 1215 Simi- 
larly, many approaches to local multiple alignment 
confine themselves to seeking ungapped blocks of 
aligned residues separated by regions of variable 
length that are left unaligned. 1623 One widely used 
database of protein motifs is constructed of just such 
ungapped blocks. 24 A possible view is that such 
constraint is imposed only for algorithmic reasons— 



*Correspondenceto: Stephen F.AItschul, National Center for 
Biotechnology Information, National Library of Medicine, Na- 
tional Institutes of Health, Bethesda, MD 20894. E-mail: 
altschul(§n cbi.nlm.nih.gov 

Received 5 February 1997; Accepted 22 J anuary 1998 



GENERALIZED AFFINE GAP COSTS 



89 



that fully aligning the regions between two blocks 
would simply be too time-consuming, and accord- 
ingly is omitted. Often this perception may be cor- 
rect, butasthestructural alignment example shows, 
it is frequently more accurate to claim that the 
segments separating two conserved regions should 
not be aligned than to impose an alignment upon 
them. If this is true for structural and multiple 
alignments, does it have any relevance for simple 
pairwise alignment? 

One original motivation for pairwise sequence 
alignment was the reconstruct! on of molecular evolu- 
tion. 3 ' 2526 Confining attention to substitution, inser- 
tion, and deletion mutations, onecan claim that two 
homologous sequences have a historically correct 
alignment, which it is the goal of sequence compari- 
son to approximate as well as possible. However, for 
distantly related proteins, an alternative viewpoint 
may emerge. Some protein regions are under greater 
structural constraint than others, and therefore 
evolve more slowly. As a result, two proteins may 
share several regions with recognizable similarity, 
separated by regions bearing no detectable mutual 
relationship. At the structural level, this lack of 
similarity may actually reflect a loss of three- 
dimensional correspondence. Even from an evolution- 
ary perspective, while it may still make sense to 
align these regions, the requisite information for 
doing so simply may have been lost. 

This article introduces a generalization of affine 
gap costs, applicable to both global and local pair- 
wise alignments, that within a larger alignment 
permits apparently unrelated sequence regions to 
remain unaligned. For local alignments, thedistribu- 
tion of optimal alignment scores is shown empiri- 
cally to follow an extreme value distribution. The 
relevant statistical parameters may be estimated for 
different gap cost settings. The effectiveness of a 
given alignment scoring system may be measured 
both by the degree to which it yields statistically 
significant scores for related sequences, and by the 
degree to which the optimal alignments it generates 
conform to biological reality. In many cases, general- 
ized affine gap costs prove superior to traditional 
costs by both of these criteria. Empirical studies can 
guidegap cost selection, 27 but general considerations 
regardi ng features of the al i gnments sought al so can 
inform the choice. Generalized affine gap costs may 
be introduced intoapplications that employ pairwise 
sequence alignment, such as progressive multiple 
alignment. 

GAPS AND THEIR ASSOCIATED COSTS 

Traditionally, a gap within a pairwise alignment is 
defined to consist of k residues from a single se- 
quence, and affine gap costs assign it a score of 
negative a + bk. We generalize the notion of gap to 
involvefei residues from sequence A and k 2 residues 
from sequences. Onecan assign a cost to such a gap 



in many different ways. Perhaps the simplest exten- 
sion of affine gap costs gives this gap a score of 
negatives + b{ki + k 2 ). u However, this definition is 
indifferent between, say, 30 residues gapped out of a 
single sequence and 15 residues left unaligned in 
each of the two sequences. From structural and 
perhaps even evolutionary considerations, one may 
wish to prefer the latter case. Accordingly, we intro- 
duce a three-parameter generalization of affine gap 
costs, in which the score a is assessed for the 
existence of a gap, -b for each residue inserted or 
deleted, and -c for each pair of residues left un- 
aligned. More formally, the score for a gap involving 
ki and k 2 residues, with ki > k 2 , is negative a + 
b{ki - k 2 ) + dc 2 .Wewill represent these generalized 
affine gap costs by the ordered triple (a, b, c). When 
c = <», these costs reduce to traditional affine gap 
costs, and when c = 2b, they reduce to those pro- 
posed by Zuker and Somorjai. 12 Note that we have 
adopted a different parameter-naming convention 
than on occasion is used elsewhere. 2728 Specifically, 
the gap opening score a is sometimes taken to 
include the score for the first inserted or deleted 
residue, whilehereit is not. 

Generalized affine gap costs may be used in either 
the global or local alignment context. For global 
alignments, one has as always the choice of whether 
to score end gaps differently than internal gaps. 
Standard dynamic programming algorithms for ei- 
ther global or local alignment can easily accommo- 
date generalized affine gap costs, in an analogous 
manner to their treatment of traditional affine gap 
costs. The mai n difference is that once a gap has been 
opened, diagonal moves within the path graph are 
permitted, with a score c, in addition to vertical or 
horizontal moves with a score -b (Fig. 1). Implemen- 
tations that return the optimal alignment score and 
a representation of all optimal alignments require 
O(mn) time and space. 78 If only a single optimal 
alignment is required, thespace requirement can be 
reduced to 0(min(m, n)). 9 The details of these algo- 
rithms are easily reconstructed, and will be omitted 
here. 

LOCAL ALIGNMENT STATISTICS 

Little is known concerning the distribution of 
optimal global alignment scores from the pairwise 
comparison of random sequences. In contrast, the 
random distribution of optimal local alignment scores 
is quite well understood. The prototypical case is 
that of local alignments in which gapsareforbidden; 
the scores of such alignments have been shown 
analytically to follow an extreme value distribu- 
tion. 2930 Given a matrix of substitution scores s /; - for 
aligning pairs of residues, and background probabili- 
ties p, for the occurrence of residues within the 
sequences, the values of two key parameters, \ and 
K, may be calculated. (The expected score 2, y p,pys, 7 
for aligning two random residues must be negative 



90 



S.F. ALTSCHUL 



-a 



-h 



-h 



\ -c 



TO 



Fig. 1. A schematic representation of how scores are assessed 
within a path graph when generalized affine gap costs are employed. 
The score -a is charged for the existence of a gap; -b for each 
unpaired residue left unaligned; and -c for each pair of residues left 
unaligned. The solid diagonal line represents a pair of aligned residues; 
the dotted diagonal line, a pair of unaligned residues; and the dotted 
horizontal and vertical lines, single unaligned residues. 



for the theory to hold). Using these parameters, the 
raw score S of the optimal local alignment may be 
converted to a normalized score S' bytheformula 



S' 



\S - In K 
m~2 



(1) 



Such a normalized scoreS' is said to be expressed in 
bits. The expected number of distinct segment pairs 
with normalized score greater than or equal tox is 
then well approximated bytheformula 



E(S' 



N/2" 



(2) 



where the search-space size N is the product of the 
lengths of the sequences being compared. 2930 

Once gaps and their associated costs are allowed 
within local alignments, the statistical theory out- 
lined above is no longer known to hold. However, 
some theory 31 and many computational experi- 
ments 283233 strongly suggest that it does. The only 
practical difference is that one may no longer calcu- 
late X and K analytically. Instead, they must be 
estimated by either random simulation or the com- 
parison of real but unrelated sequences. 28 3235 

All statistical studies of gapped local alignments 
to date, of course, have employed at most affine gap 
costs. While it appears likely that the same statisti- 
cal theory will apply to the scores of alignments 
generated using the generalized affine gap costs 



introduced here, it is nevertheless desirable to ad- 
duce some empirical support. Accordingly, we gener- 
ated 24,000 pairs of length 1,000 random protein 
sequences, usi ng the background ami no acid frequen- 
cies of Robinson and Robinson. 36 Each pair was 
compared using a scaled version (Fig. 2) of the 
BLOSUM-62 amino acid substitution matrix, 37 and 
(120, 10, 3) generalized affinegap costs. A histogram 
of the 24,000 optimal local alignment scores pro- 
duced isshown in Figure3.Thebestfitof an extreme 
value distribution 38 to these data was estimated by 
themaximum likelihood method, 39 andtheresulting 
curve is shown in Figure 3. A x 2 goodness-of-fit test, 
with 275 degrees of freedom, had the value 290.1; a 
worse fit would be expected 27% of the time even 
were the extreme value theory precisely valid. Analo- 
gously to traditional affine gap costs, 28 we have per- 
formed more extensive tests on generalized affine gap 
costs (data not shown) to establish that they conform to 
other aspects of the basic statistical theory 29 ' 30 

To employ Equations (1) and (2), all that is needed 
are estimates of A. andK. For any set of gap costs we 
consider, these parameters were estimated as de- 
scribed above. The standard error for the resulting 
estimate of X. was approximately 0.5%, and for K 
approxi mately 5%. H owever, the method for estimat- 
i ng these parameters 3 9 has the effect of making thei r 
errors approximately proportional. As a result, the 
standard error for normalized scores in the range of 
40 bits is about 0.1 bits. In addition to being subject 
to stochastic error, the parameter estimates are of 
course dependent on the particular random protein 
model used. With X. and K in hand, Equation (1) 
converts raw scores into normalized scores, ex- 
pressed in bits. This normalization permits the 
alignment scores generated by different substitution 
matrices and gap costs to be directly compared. 40 41 

BIOLOGICAL EXAMPLES 

For sequences that are dosed related, generalized 
affi ne gap costs wi 1 1 provide no advantage to traditi onal 
gap costs, because related regions will not be inter- 
rupted by regions without detectablesi mi larity Tostudy 
whether generalized gap costs can improve the detec- 
tion of weak relationships, we used an appropriately 
modified version of the Smith-Waterman algorithm 4 to 
search release 34 plus updates of the SWISS-PROT 
database 42 with 11 protein queries. For homologous 
database sequences that barely attained statistical sig- 
nificance, we compared the scores returned by general- 
ized and traditional gap costs. The results are shown in 
Table I. 

For our database searches, we used a version of 
the BLOSUM-62 amino acid substitution matrix 37 
(Fig. 2), scaled by a factor of 10 so that gap scores 
could bekept integral. To select reasonable gap costs 
for general-purpose sequence comparison, there is 
littlesubstituteforempiricism.Anexhaustiveevalu- 
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Fig. 2. A scaled version of the BLOSUM-62 amino acid 
substitution matrix. 37 Because we wish to consider gap costs that 
would be fractional in the usual units in which that matrix is 
expressed, and so that we may continue to deal in integers, we 
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have multiplied the standard matrix by 10. Since the matrix was 
originally constructed by rounding real numbers to the nearest 
integer, we have returned to the raw data to gain precision. 



ation of gap cost parameter space for the most 
sensitive parameter settings is beyond the scope of 
this article. However, we have found that in conjunc- 
tion with our scaled BLOSUM-62 matrix, (120, 10, 3) 
gap costs prove generally effective; the correspond- 
ing statistical parameters are estimated at X ~ 
0.0286 and K ~ 0.041. For the purposes of compari- 
son, it is appropriate to select a set of traditional 
affinegap costs (c= =») with nearly identical X.This 
renders raw alignment scores nearly comparable, 
allowing the comparison to hinge almost completely 
on the differential scoring of gaps. Accordingly, we 
kept the b gap cost parameter fixed at 10, and 
lowered the penalty a for the existence of a gap from 
120 to 97; the statistical parameters for (97, 10, °°) 
gap costs were then estimated to be X ~ 0.0286 and 
K ~ 0.046. Are these a reasonable set of traditional 
gap costs to employ? Pearson 27 has conducted perfor- 
mance tests on a large variety of search algorithms, 
substitution matrices, and traditional affine gap 
costs. I n his nomenclature our scoring system would 
correspond roughly to (11, 1) gap costs used in con- 
junction with the standard BLOSUM-62 matrix. 
WhilePearson offers no single prescription of scoring 
system for database searching, this one at least falls 
wi thi n the set of reasonabl e choi ces. 

To focus on distantly diverged but homologous 
sequences, we analyzed only alignments that ap- 
peared moderately significant (0.1 > £ > 0.0001) us- 
ing at least oneof our twosets of gap costs. For our 11 



queries, the number of alignments satisfying this 
condition ranged from lto71. SWISS-P ROT annota- 
tion suggested that all such alignment represented 
bidogicallymeaningful relationships, withtheexcep- 
tion of one returned by the histocompatibility anti- 
gen query. This singlefalse positive received a score 
greater by 0.9 bits using the traditional gap costs. As 
shown in Table I , for eight of the 11 queries the mean 
normalized score using (120, 10, 3) gap costs was 
higher than that with (97, 10, °o) gap costs. Averaged 
over queries, the mean scoredifferential was 0.6 bits, 
corresponding to a factor of 1.5 in statistical signifi- 
cance. The optimal score for most database se- 
quences is not greatly affected by the use of one set of 
gap costs or the other. Nevertheless, for eight (vs. 
two) queries, the use of generalized gap costs im- 
proved at least one alignment score by more than 
three bits, enough to affect materially the ability to 
recognizea similarity as statistically significant. 

One may inquire into not only the score of a 
sequence similarity, but also the accuracy of the 
alignment to which it corresponds, as measured by 
somegold standard, such as thealignment's congru- 
ence with a multipleor a structural alignment. It is 
not clear how one is best to construct such an 
objective standard. Wetook therelatively straightfor- 
ward approach of applying one iteration of the 
PSI -BLAST program 43 to each of our queries. This 
program constructs a multiple alignment from the 
significant alignments returned by an initial data- 
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Fig. 3. A histogram of the optimal local alignment scores of 
24,000 pairs of random sequences of length 1,000, generated 
using the amino acid frequencies of Robinson and Robinson. 36 
Scores were calculated using the BLOSUM-62 substitution matrix 



of Figure 2, and (120,10,3) gap costs. The superimposed 
extreme value distribution 38 was calculated to fit the data by the 
method of maximum likelihood. 39 A x 2 goodness-of-fit test, with 
275 degrees of freedom, has the value 290.1 . 



base search, and then uses a position-specific score 
matrix derived from this alignment to perform a subse- 
quent search. So that reasonable credence could be 
given to the alignments used to construct PSI-BLAST's 
score matrix, we employed a stringent initial cutoff E 
value of 10 10 . Also, we ran P SI -BLAST using tradi- 
tional gap costs, which should tend to bias the align- 
ments it returns in favor of the pairwise alignments 
produced by the same costs. Of the 22 similarities of 
Table I for which generalized gap costs produced substan- 
tially greater scores, 19 (120, 10, 3)-alignments con- 
formed better than did (97, 10, ^-alignments to the 
corresponding PSI -BLAST alignments, two equiva- 
lently, and one worse (alignment results not shown). 
Conversely, of thefour similarities for which traditional 
gap costs produced substantially greater scores, one 
(97, 10, °o)-alignment conformed better tothecorrespond- 
ing PSI -BLAST alignment, two equivalently, and one 
worse. This asymmetrical result suggests that general- 
ized affine gap costs, in addition to returning higher 
scores for moderately similar sequences, also tend to 



produce alignments that conform better to biological 
reality. 

To illustrate the potential utility of generalized 
affine gap costs, we consider the conserved domain 
that is shared by the BRCA1 protein, 44 the human 
p53-binding protein 53BP1, 45 and many other hu- 
man, yeast, and even bacterial proteins involved in 
cell cycle checkpoints. 4648 Using the 202-residue, 
putatively globular, C-terminal domain of BRCAlas 
the query in a database search, the original clue to 
the existence of this superfamily was an alignment 
with 53BP1. With the default scoring system pro- 
vided by the BLAST program, 49 the alignment in 
isolation was not statistically significant, and only 
subsequent motif searches and multiple alignments 
established therelationship. 46 (For the original data- 
base search performed, the search-space size was 
approximately 1.2 x 10 10 , implying that a normal- 
ized score of 37.8 bits was necessary for statistical 
significance.) Here, we compare the C-terminal do- 
main of BRCA1 with 53BP1 using both sets of gap 
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TABLE I. Relative Sensitivity of Traditional and Generalized AffineGap Costs* 
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*U sing a generalization of the Smith-Waterman algorithm, 4 all queries were compared to release 34 pi us updates of the SWISS-PROT 
database 42 (68,619 sequences; 24, 728,649 ami no acids). Alignment scores were derived from the scaled BLOSUM-62 matrix of Figure 
2 and both traditional (97, 10,™) and generalized (120, 10, 3) affine gap costs. E values were calculated for both sets of gap costs using 
Equations (1) and (2). An edge-effect correction 28 for search-space size was employed, based on a calculated relative entropy 41 of 0.65 
bits for ungapped alignments. Moderately significant alignments are defined as those whose smal I erE value is <0. land whose larger 
£ value is >0.0001. 



costs described above. The optimal score yielded by 
(97, 10, co) gap costs is 34.3 bits, while that yielded by 
(120, 10, 3) gap costs is 38.5 bits; alignments achiev- 
ing these scores are shown in Figure 4A and B. The 
score of the latter result is greater by 4.2 bits, 
corresponding to a factor of 18 in statistical signifi- 
cance. Furthermore, the alignment of Figure 4B 
nearly agrees with that implied by the multiple 
alignment of Koonin et al., 46 while the alignment of 
Figure 4A diverges substantially (and presumable 
inaccurately) over its central region. This poorly 
conserved region is left substantially unaligned by 
the generalized affine gap costs; notice that one pair 
of segments remains unaligned in Figure 4B even 
though alignment could beimposed without introduc- 
ing null characters into either sequence. Tellingly, 
once other sequences are added to the alignment, 
these segments span a region into which gaps must 
be introduced. 46 

FURTHER THOUGHTS ON GAP 
COST SELECTION 

Because for a given set of substitution costs it has 
not been easy to define the optimal gap costs, one 
approach that has been advocated is to try them all. 
It can be shown that the space defined by the gap 
cost parameters may be divided systematically into 
regions in which the same alignments are optimal. 
Parametric alignment programs that perform such a 
dissection of parameter space have been described 
and made available. 51 52 One problem with this ap- 
proach is that it generates a potentially very large 
number of alignments, with noguidancefor choosing 
among them. Normalized scores, however, can pro- 



vide an objective criterion for choosing among param- 
eter settings. 40 The problem with applying them to 
parametric alignment is that the boundaries of 
parameter-spaceregionscan not be predicted before- 
hand, and the stochastic experiment required to 
esti mate \ and K with any accuracy for a si ngl e set of 
parameters requires many minutes of computational 
time on a standard current workstation. 

An alternative approach istoprecomputeX and K 
for many points placed regularly through a reason- 
able region of gap cost space. One may then simply 
calculate the optimal alignment score for each gap 
cost setting, and return those costs and the associ- 
ated alignment that yield the highest normalized 
score, and thus the most significant result. One 
disadvantage is that there is no guarantee that the 
preselected gap cost settings include ones that are 
even near optimal for the problem at hand. Further- 
more, it must be recognized that, while one may use 
the normalized score of Equation (1) as an objective 
criterion for selecting a set of gap costs, it is improper 
to use Equation (2) to calculate an E valuefrom the 
normalized score. The reason is that one has per- 
formed multiple tests, and optimized among them. 40 
One may calculate a conservative upper bound on 
theE value by multiplying that derived from Equa- 
tion (2) by the number of parameter sets examined, 
but, due to the high degree of correlation among 
tests, this generally yields a gross overestimate. 
However, if the same sets of gap costs are to be 
examined repeatedly, it is possible but laborious to 
esti mate the parameters for the new extreme val ue 
distribution that results from optimizing over the 
normalized scores. 40 



94 



S.F. ALTSCHUL 



(a) 



BRCA1 1699 RTLKYFLGIAGGKWWSYFOTTQSIKERKMLNEHDFevrgdWNGRNHQGPKRAR 1753 

RT KYFL +A G VS+ WV S ++ N ++ ++ + +R 
53BP1 866 RTRKYFLCLASGIPCVSHVWVHDSCHANQLQNYRNY LLPAGYSLEEQRIL 915 

####################################### 

BRCA1 1754 ESQDRKifrgleiccyGPFTNMP TDQLEWMVQLC GASWKELSS 1797 

+ Q R+ PF N+ +DQ + ++L GA+ VK+ S 

53BP1 916 DWQPRE NPFQNLKvllvSDQQQNFLELWseilmtgGAASVKQHHS 960 



BRCA1 1798 FT 



53BP1 



LGTGVHPIVWQPDAwteDNGFHAIGQMCEAPWTREWVL 1839 
++ GV +W P ++ + PW++EWV+ 

961 SAhnkdIALGVFDVWTDPSC PASVLKCAEALQLPWSQEWVI 1003 



(b) 



BRCA1 1699 RTLKYFLGIAGGKWWSYFWVTQSIKERKMLNEHDFevrgdvvngrnhqgpKRAR 1753 

RT KYFL +A G VS+ WV S ++ N ++ +R 

53BP1 866 RTRKYFLCLASGIPCVSHVWVHDSCHANQLQNYRNYllpagyslee QRIL 915 



BRCA1 1754 ESQDRKi-FRGLEIccygpftnmptdqlewmvqlcGASWKELSSft LGTG 1803 

+ Q R+ F+ L++ GA+ VK+ S ++ G 

53BP1 916 DWQPREnpFQNLKVllvsdqqqnflelwseilmtgGAASVKQHHSsahnkdIALG 970 



BRCA1 



53BP1 



1804 VHPIVWQPDawtedngfhaiGQMCEAPWTREWVL 1839 

V +W P ++ + PW++EWV+ 

971 VFDVWTDPScpasvlkc AEALQLPWSQEWVI 1003 



Fig. 4. Two alignments of the C-terminal domain of human 
breast cancer type 1 susceptibility protein (BRCA1 ; SWISS-PROT 
accession number P38398), and a fragment of the human p53 
binding protein 1 (53BP1 ; GenBank 50 accession number U09477). 
Uppercase is used for aligned residues and lowercase for gapped 



residues, a: The optimal local alignment, using (97, 10, °°) gap 
costs, has score 34.3 bits (raw score 723). Pound signs indicate a 
region where this alignment diverges substantially with that below. 
b: The optimal local alignment, using (120, 10, 3) gap costs, has 
score 38.5 bits (raw score 822). 



Are there any theoretical considerations that can 
guide the choice of gap costs? Recall that even when 
no insertions or deletions need to be invoked, gener- 
alized affine gap costs may leave unaligned a di- 
verged region that separates two related ones. One 
may calculate the approximate minimum length 
such a region needs to have before leaving it un- 
aligned becomes profitable. First, from the substitu- 
tion matrix used and the background amino acid 
frequencies, the expected score s for aligning two 
random residues may be calculated. Then for each 
pair of unrelated residues left unaligned, one gains 
on averagea score of s c. However, to realize this 
gain, one must pay a gap opening penalty of a. Thus, 
on average, it is beneficial to leave unaligned two 
unrelated segments when they are of length at least 
/ = a/(s - c). (Of course if a gap needs to be intro- 
duced in any case due to an insertion or deletion, it 
pays to leave any contiguous, diverged segments 
unaligned.) For the matrix of Figure 2, s is approxi- 
mately 10, so for (120, 10, 3) generalized affine gap 
costs,/ ~ 17. It is evident from this analysis that one 



of the main reasons for using generalized affine gap 
costs is substantially lost if c is greater than s, or 
even qu i te cl ose to i t. 

When the score X for a region within an align- 
ment is sufficiently negative, it generally makes 
more sense to break thealignment into two separate 
ones. 2853 If one imagines the end of a given pair of 
aligned segments to be fixed, it is then possible to 
define the maximum extent of a gap that imposes a 
cost less than X. The shape, within a path graph, of 
thisallowablegap region will depend on the relative 
values of the gap cost parameters b and c (Fig. 5), 
while its size will depend additionally on a and the 
nominal scoreX. For example, using (120, 10, 3) gap 
costs, with \ ~ 0.0286, for a gap to impose a penalty 
of fewer than 15 bits it may have a nominal cost no 
greater than 363. If no residue pairs are left un- 
aligned, the maximal number of inserted or deleted 
residues is then 24, while if no residues are inserted 
or deleted, the maximal number of unaligned pairs is 
81. Given a sense of the maximal desirable extent of 
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(a) 



(b) 



(c) 



(d) 



\ 



Fig. 5. The greatest extent, within a path graph, of a gap with 
cost less than Xwhen (a) c = °°, (b) c = 2b, (c) c = £>, and (d) c = 
b/3. Given a string of aligned residues, represented by a solid 
diagonal line, the line representing the next string of aligned 
residues must start within the region enclosed by dashed lines. 



a gap, one may be guided by such calculations in 
on e's ch oi ce of ga p costs . 

CONCLUSION 

We have seen a number of cases in which general- 
ized affi ne gap costs i mprove somewhat the abi I i ty to 
detect biological relationships, as well astoconstruct 
biologically accurate alignments. Whether these costs 
should be incorporated into database search pro- 
grams such as Fasta, 54 or gapped versions of 
BLAST, 2843 depends on whether the slight increase 
in sensitivity is deemed worth theslight decrease in 
speed. For a program such as PSI -BLAST, 43 however, 
gen erali zed affi negap costs may off era more substan- 
tial improvement, because better alignment accu- 
racy in the output from one database search can 
engender a more sensitive position-specific score 
matrix for the next. 

There are other sequence comparison formalisms 
into which generalized affine gap costs might be 
incorporated. Many multiple alignment programs 
depend on a progressivealignment strategy, in which 
at first two and then greater numbers of sequences 
are coalesced into a single alignment. 5563 One diffi- 
culty with this approach is that alignments formed 
early in the process are constructed in ignorance of 
most of the available data, and therefore may easily 
freeze in a mistake. By permitting poorly conserved 
regions to be left unaligned, generalized affine gap 
costs may partially mitigate this problem. However, 
extending generalized affine gap costs to multiple 
alignments undoubtedly will entail unforeseen tech- 
nical difficulties, both definitional and algorithmic. 64 
Also, the increasingly studied Hidden Markov Model 



formalism for representing protein families 6568 may 
be able to subsume all that generalized affine gap 
costs can offer to the multiple alignment problem. 

As the original motivation for generalized affine 
gap costs suggests, nonaligned regions may corre- 
spond to loops separating modular or secondary 
structural elements. Theliteratureon protein second- 
ary- and tertiary-structure prediction is too large to 
bereviewed here. H owever, ideas roughly correspond- 
ing to the one studied above are already in common 
use (e.g., Ref. 14). Thus it is unlikely that general- 
ized affine gap costs have much to offer structural 
analysis. Because the field of pairwise sequence 
comparison has been fairly thoroughly plowed, oneis 
accustomed to trying to generalize its ideas to mul- 
tiple and structural alignment. It is worth recogniz- 
ing that here the generalization has proceeded in the 
opposite direction. 
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